kscheppa4707@floridapoly.eduFor this project, I am continuing with my Airplane Crash Data Since 1908 dataset. I am really excited to begin this project as I know I will be able to accomplish the analysis that I did not get to explore last time. That being said, I plan to explore the Operator data to see the operators with the most crashes, the Summary data to see what were the top crash explanations, and the Long and Lat data to see exactly where these crashes occurred.
crashes <- read.csv("../data/airplanesData.csv")
head(crashes)
summary(crashes)
ï..ID Date Time Location
Min. : 1 Length:4967 Length:4967 Length:4967
1st Qu.:1242 Class :character Class :character Class :character
Median :2484 Mode :character Mode :character Mode :character
Mean :2484
3rd Qu.:3726
Max. :4967
Lat Long Operator Flight..
Min. : -16 Min. :-176.66 Length:4967 Length:4967
1st Qu.: 19 1st Qu.: -98.61 Class :character Class :character
Median : 35 Median : -85.00 Mode :character Mode :character
Mean : 10222 Mean : -76.64
3rd Qu.: 41 3rd Qu.: -73.70
Max. :50626552 Max. : 174.12
Route AC.Type Registration cn.ln
Length:4967 Length:4967 Length:4967 Length:4967
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
Aboard Aboard.Passangers Aboard.Crew Fatalities
Min. : 0.00 Min. : 0.0 Min. : 0.00 Min. : 0.00
1st Qu.: 7.00 1st Qu.: 3.0 1st Qu.: 2.00 1st Qu.: 4.00
Median : 16.00 Median : 12.0 Median : 4.00 Median : 11.00
Mean : 31.09 Mean : 26.9 Mean : 4.48 Mean : 22.34
3rd Qu.: 35.00 3rd Qu.: 30.0 3rd Qu.: 6.00 3rd Qu.: 25.00
Max. :644.00 Max. :614.0 Max. :61.00 Max. :583.00
NA's :18 NA's :229 NA's :226 NA's :8
Fatalities.Passangers Fatalities.Crew Ground Summary
Min. : 0.00 Min. : 0.000 Min. : 0.000 Length:4967
1st Qu.: 1.00 1st Qu.: 2.000 1st Qu.: 0.000 Class :character
Median : 8.00 Median : 3.000 Median : 0.000 Mode :character
Mean : 19.02 Mean : 3.579 Mean : 1.728
3rd Qu.: 21.00 3rd Qu.: 5.000 3rd Qu.: 0.000
Max. :560.00 Max. :46.000 Max. :2750.000
NA's :242 NA's :241 NA's :41
crashes <- separate(data = crashes,col = Date, into = c("Month", "Day", "Year"), sep = "\\/")
crashes <- separate(data = crashes, col = Location, into = c("City", "Region", "Country"), sep = "\\, ")
Expected 3 pieces. Additional pieces discarded in 9 rows [150, 663, 1722, 2369, 2601, 2849, 3139, 3707, 4848].Expected 3 pieces. Missing pieces filled with `NA` in 4611 rows [1, 2, 3, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, ...].
The below code was used for the Crashes Per Operator Per State graph.
small_DF <- crashes %>%
group_by(Operator) %>%
count(Operator) %>%
arrange(desc(n)) %>%
filter(n > 50)
operatorData <- crashes %>%
group_by(Operator, Year) %>%
count(Operator) %>%
arrange(desc(n)) %>%
filter(n > 0)
operatorData
df <- operatorData %>%
filter(Operator %in% small_DF$Operator) %>%
group_by(Year)
df
The below code was used for the Crash Count Per State graph.
my_sf <- st_as_sf(filter(crashes, Long<0, Lat>0), coords = c('Long', 'Lat'))
state_map_data <- map('state', fill = TRUE, plot = FALSE) %>% st_as_sf()
my_sf<-st_set_crs(my_sf, st_crs(state_map_data))
sf::sf_use_s2(FALSE)
Spherical geometry (s2) switched off
state_map_data$crash_count <- lengths(st_intersects(state_map_data,my_sf))
although coordinates are longitude/latitude, st_intersects assumes that they are planar
The below code was used for the Crash Location graph.
df_geom <- crashes %>%
group_by(Year) %>%
filter(Region %in% c("Alabama", "Arizona", "Arkansas", "California", "Colardo", "Connecticut", "Delaware", "Florida", "Georgia", "Idaho", "Illinois", "Indiana", "Iowa", "Kansas", "Kentucy", "Lousiana", "Maine", "Maryland", "Massachettusetts", "Michigan", "Minnesota", "Mississippi", "Missouri", "Montana", "Nebraska", "Nevada", "New Hampshire", "New Jersey", "New Mexico", "New York", "North Carolina", "North Dakota", "Ohio", "Oklahoma", "Oregon", "Pennsylvania", "Rhode Island", "South Carolina", "South Dakota", "Tennessee", "Texas", "Utah", "Vermont", "Virginia", "Washington", "West Virginia", "Wisconsin", "Wyoming")) %>%
select(Year, Region, Long, Lat)
my_sf2 <- st_as_sf(filter(df_geom, Long<0, Lat>24), coords = c('Long', 'Lat'))
state_map_data2 <- map('state', fill = TRUE, plot = FALSE) %>% st_as_sf()
my_sf2 <- st_set_crs(my_sf2, st_crs(state_map_data))
sf::sf_use_s2(FALSE)
state_map_data$crash_count <- lengths(st_intersects(state_map_data,my_sf2))
although coordinates are longitude/latitude, st_intersects assumes that they are planar
The below code was used for the Crash Summary Bigram graph.
bigramData <- crashes %>%
select(Summary)
bigramData
sum_bigram <- bigramData %>%
unnest_tokens(bigram, Summary,
token = "ngrams", n = 2)
sum_bigram %>%
count(bigram, sort = TRUE)
bigrams_filtered <- sum_bigram %>%
separate(bigram, into = c("word1", "word2"), sep = " ") %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
bigram_count <- bigrams_filtered %>%
count(word1, word2, sort = TRUE)
bigram_count
bigram_unite <- bigrams_filtered %>%
unite(bigram, word1, word2, sep = " ")
bigram_unite
bigram_graph <- bigram_count %>%
filter(n > 60) %>%
graph_from_data_frame()
f <- ggplot(data = df, aes(x = Year, y = n, color = Operator, group = 1)) +
geom_point(palette = "Blues")+
geom_line(palette = "Blues")+
labs(title = "Crashes Per Operator Per Year",
y = " ",
x = "Year")+
theme_minimal()+
#scale_x_continuous(expand = )+
theme(plot.title.position = "plot")
Ignoring unknown parameters: paletteIgnoring unknown parameters: palette
ggplotly(f)
This graph shows the top 4 operators that had the most crashes. You can hover over the point to see which year it occured in, the number of crashs, and the operator. One of the things i found intresting from this graph is that Deutsche Lufthansa made it into the top 4 for most crashes, but was only had crashes from 1908-1945. Something that may be worth looking at is a crash ratio or flights flown to flights crashed.
k <- ggplot() +
geom_sf(data = state_map_data, aes(fill = crash_count))+
scale_fill_distiller(palette="Reds", direction = +1)+
labs(title = "Crash Count Per State")+
guides(fill = FALSE)+
theme_minimal()+
theme(plot.title.position = "plot")
`guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> = "none")` instead.
ggplotly(k)
In this graph, we can see the number of crashes that occured in each state.
p <- ggplot() +
geom_sf(data = state_map_data)+
geom_sf(data = my_sf2, color = 'red', size = 0.1)+
theme_minimal()+
labs(title = "Crash Location per State")+
theme(plot.title.position = "plot")
ggplotly(p)
Building on the last graph, this one shows the location of each crash in the state.
set.seed(2021)
a <- grid::arrow(
type = "closed",
length = unit(.12, "inches")
)
ggraph(bigram_graph, layout = "kk") +
geom_edge_link(
aes(edge_alpha = n),
show.legend = FALSE, arrow = a,
edge_width = 1) +
geom_node_point(color = "black",
size = 4) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1)+
labs(title = "Crash Summary Bigram")+
theme_minimal()+
theme(plot.title.position = "plot")
In this model, we can see the most common word parings from the crash summary. I thought this would be intresting to explore as it could give some insight into what caused the crash.
This analysis helped me see a different side of this data. It showed me the where and why that I missed in my first project. It was interesting to explore the different states counts, and the the individual location of each crash. It was even a little surprising to find out that California had the most crashes out of the continental US. As for the crash explanations, so were to be expected, like engine -> failure or weather -> conditions, but I thought it was interesting to see some like caught -> fire and struck -> tree.
I am really disappointed that I was not able to get ‘gganimate’ to work, and I hope to implement that in my future analysis. Additionally, I had a ‘leaflet’ plot that was really messy, so I hope to refine that and add that to a future version of this project.
date_df <- select(crashes, c(Month, Year, Fatalities, Aboard, AC.Type, Long, Lat))
#Changing the data type of fatalities to a numeric
date_df$Fatalities <- as.numeric(date_df$Fatalities)
#Changing all the NAs to 0s
date_df[is.na(date_df)] <- 0
#Changing the data type of year to a numeric
date_df$Year <- as.numeric(date_df$Year)
#Creating a new column in our dataset called decade
date_df <- date_df %>%
mutate(Decade = floor(Year/10) * 10)
international_sf <- st_as_sf(date_df, coords = c('Long', 'Lat'))
international_sf <- st_set_crs(international_sf, st_crs(state_map_data3))
state_map_data3 <- maps::map('state', fill = TRUE, plot = FALSE) %>%
st_as_sf()
library(leaflet)
leaflet() %>%
addTiles() %>%
addCircleMarkers(data = international_sf, fillColor = international_sf$Year, fillOpacity = 1, stroke = FALSE, radius = 5, group = international_sf$Year) %>%
addLayersControl(overlayGroups = international_sf$Year, options = layersControlOptions(collapsed = FALSE))
bounding box has potentially an invalid value range for longlat data